feat(schema): represent, serialize and validate v3 column default values (1/4)#746
Conversation
445f7ae to
4aade00
Compare
1ee5b32 to
34470af
Compare
There was a problem hiding this comment.
Pull request overview
Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).
Changes:
- Extend
SchemaFieldto storeinitial-default/write-defaultas shared immutable literals and include them in equality/ID-reassignment rebuilds. - Add JSON serde for
initial-default/write-defaultusing existing single-value literal serialization. - Update schema projection to use
FieldProjection::Kind::kDefaultwhen an expected field is missing but hasinitial-default, and add/extend unit tests + v3 metadata fixture.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/test/schema_util_test.cc | Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present. |
| src/iceberg/test/schema_test.cc | Adds schema validation/version-gating and ID-reassignment preservation tests for default values. |
| src/iceberg/test/schema_json_test.cc | Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs). |
| src/iceberg/test/resources/TableMetadataV3Valid.json | Adds a v3-valid table metadata JSON fixture. |
| src/iceberg/schema.cc | Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate. |
| src/iceberg/schema_util.cc | Projects missing fields with initial-default as kDefault rather than error/null. |
| src/iceberg/schema_field.h | Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors. |
| src/iceberg/schema_field.cc | Implements default accessors, validation of defaults, and includes defaults in equality. |
| src/iceberg/json_serde.cc | Serializes/parses initial-default / write-default on schema fields via literal single-value serde. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
First of a multi-part split of column default value support (apache#730) — the schema foundation the read and evolution paths build on. Purely additive; no read/write behavior change on its own. - SchemaField carries `initial-default` / `write-default` (immutable std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault / WithWriteDefault modifiers; getters return optional<reference_wrapper>. - JSON serde reads/writes `initial-default` / `write-default` via the existing single-value serialization. - Schema::Validate rejects default values below format v3 and validates they are non-null primitive literals matching the field type. - Generic schema projection maps a column missing from a data file with an initial-default to FieldProjection::Kind::kDefault. Read-path application (Parquet/Avro) and schema evolution follow in separate PRs. See apache#731 for the full end-to-end proof-of-concept.
34470af to
f663e0e
Compare
This file was added by this PR but is referenced by no test (table-metadata resources are loaded by explicit name via ReadTableMetadataFromResource) and carries no column defaults, so it is dead. Remove it.
Add SchemaField default-value coverage directly to json_serde_test (the binary that exercises json_serde.cc): one test pins the initial-default / write-default wire format (incl. their absence when unset), and one round-trips a field with both defaults through ToJson -> FieldFromJson for every primitive type, exercising the non-trivial date/timestamp/decimal/uuid/binary encodings the default path reuses.
…-schema # Conflicts: # src/iceberg/test/json_serde_test.cc
wgtmac
left a comment
There was a problem hiding this comment.
Static review notes; I did not build or run tests.
Addresses wgtmac's review feedback: - Enforce UTC offset for timestamptz / timestamptz_ns default values in FieldFromJson. The shared timestamp parser accepts any offset and silently normalizes to UTC, so C++ could accept default metadata Java rejects and rewrite the offset on write. Added TemporalUtils::IsUtcOffset, which reuses the existing timezone-suffix parser. - ValidateDefault now casts the default literal to the field type (Literal::CastTo) instead of requiring an exact type match, matching Java's Types.NestedField (e.g. an int default on a long field). Uncastable or out-of-range defaults are still rejected. - Documented that non-primitive (struct/list/map) defaults remain unsupported, matching Java's current model. - Extended the json_serde round-trip tests with time/timestamptz/timestamp_ns/ timestamptz_ns and a negative case for non-UTC timestamptz defaults; added a schema test for the cast-to-field-type behavior.
Directly cover the UTC/non-UTC/unparseable offset cases the timestamptz default-value enforcement relies on.
The AMD64 Windows job failed in its cleanup step (rm -rf example/build: Device or resource busy) after the example compiled and linked successfully; re-run CI.
wgtmac
left a comment
There was a problem hiding this comment.
Two remaining static review notes; I did not build or run tests.
| // be cast to the field type or fall outside its range (CastTo signals out-of-range as | ||
| // an above-max/below-min sentinel). | ||
| auto field_type = std::static_pointer_cast<PrimitiveType>(field.type()); | ||
| auto cast = value.CastTo(field_type); |
There was a problem hiding this comment.
This validates the cast but keeps storing the original literal. Java stores the result of castDefault, so a programmatic default like Literal::Long(...) on a timestamp field would validate here, then later serialize or project as a long instead of a timestamp default. Either the field needs to retain the casted literal, or this should keep requiring an exact type match.
|
|
||
| Status ValidateDefault(const SchemaField& field, const Literal& value, | ||
| std::string_view kind) { | ||
| if (value.IsNull() || value.IsAboveMax() || value.IsBelowMin()) { |
There was a problem hiding this comment.
This rejects explicit null defaults. The spec allows optional field defaults to be null and says they may be explicitly set. Java mostly models a null default as absence, but C++ now has a present-null Literal path that will fail validation and, for initial-default, also trips the format-version gate by presence rather than by non-null value.
Keep ICEBERG_DECLARE_JSON_SERDE identical to upstream and declare the four schema/metadata-bearing models (LoadTableResult, CreateTableRequest, CommitTableRequest, CommitTableResponse) explicitly, since only their ToJson return type (Result) differs.
…-schema # Conflicts: # src/iceberg/catalog/rest/rest_catalog.cc
…lizers (#785) ## What Change the schema/type/metadata `ToJson` serializers from returning bare `nlohmann::json` to `Result<nlohmann::json>`: - `json_serde`: `ToJson(SchemaField | Type | Schema | TableMetadata | TableUpdate)` - REST: `ToJson(CreateTableRequest | CommitTableRequest | LoadTableResult | CommitTableResponse)` Errors propagate via `ICEBERG_ASSIGN_OR_RAISE`; callers bottom out at the existing `Result`-returning boundaries (`ToJsonString`, `rest_catalog`, `TableMetadataUtil::Write`). ## Why Preparation for v3 column default values (#730 / #746). Single-value (`Literal`) serialization is fallible, and column defaults invoke it from schema serialization, so these serializers need to propagate the error instead of throwing. Splitting it out keeps the feature PR focused on the feature. ## Notes - **No behavior change** — every conversion still succeeds today; only the return type changes. - The shared `ToJsonList` template stays bare (it also serializes infallible types such as partition specs and snapshots); `TableMetadata` serializes its schema list with an explicit loop. - The REST `ICEBERG_DECLARE_JSON_SERDE` macro is unchanged; the four schema/metadata-bearing models are declared explicitly so only their `ToJson` return type differs. ## Testing No behavior change; existing tests are adapted to the new return type. Full build and `ctest` pass locally.
| if (lhs == nullptr || rhs == nullptr) { | ||
| return lhs == rhs; | ||
| } | ||
| return *lhs == *rhs; |
There was a problem hiding this comment.
Just a reminder that Literal::operator<=> returns unordered when any side IsNull() returns true so two null defaults do not equal. I think we should fix Literal::operator<=> to be null safe?
There was a problem hiding this comment.
Or add a overload Literal::NullSafeEquals()
…-schema # Conflicts: # src/iceberg/catalog/rest/json_serde_internal.h # src/iceberg/json_serde.cc # src/iceberg/test/schema_json_test.cc
… comment Addresses review feedback: document that any zero-offset spelling (Z, +00:00, -00:00) is accepted by TemporalUtils::IsUtcOffset, and shorten the verbose comment on the default-value members.
A cross-type default (e.g. an int literal on a long field, accepted to match Java's cast behavior) was stored un-normalized, so projection materialized a wrong-typed fill value and a ToJson/FieldFromJson round-trip compared unequal (Int(34) vs Long(34)). Cast the default to the field type when constructing the field; values that already match are untouched and uncastable ones are left for Validate() to reject. Fixes the projection and round-trip-equality paths.
…::Make factory Move default-value normalization out of the constructor (which cannot report errors) into a fallible factory, per review feedback. SchemaField::Make casts a cross-type default to the field type and propagates the cast error / rejects out-of-range values instead of swallowing them; it is a no-op when the default already matches the field type. The constructor stores defaults verbatim, and ValidateDefault requires a stored default to match the field type exactly (Make is the cast-accepting path). This keeps the cross-type-accept behavior while ensuring stored defaults are always field-typed, so projection and JSON round-trip equality are correct.
Per review: replace the four default-value getters (optional<reference_wrapper> readers + _ptr sharers) with a single shared_ptr-returning getter per default, and remove the [[nodiscard]] attributes from SchemaField's getters.
Per review, a present-null default (Literal::Null) is no longer rejected: it is modeled as the absence of a default (matching Java), and dropped at construction so it is never stored. This also makes two such fields compare equal with no change to Literal::operator<=> (no null Literal is ever stored or compared), so the global comparison semantics used by expression predicates are left untouched.
…-null types, tests - LiteralFromJson accepts an integer JSON token for float/double defaults (Java writes a float default as a bare integer), so Java-produced v3 metadata parses in C++. - ValidateDefault rejects a non-null default on unknown/variant/geometry/geography fields (the spec requires those to default to null). - Tests: full TableMetadata-level v3 default round-trip; non-primitive + must-be-null default rejection; SchemaField equality distinguishes default values; present-null write-default dropped to absence; integer-encoded float default parses.
Revert the float/double LiteralFromJson leniency change and its test: that is the pre-existing shared literal parser, not code this PR introduces. The integer-encoded float-default interop gap (Java writes float defaults as bare ints) is a separate follow-up in a PR that owns the expression serde.
Per scope feedback, only drop [[nodiscard]] from this PR's own additions (the default getters). Restore it on the pre-existing field_id/name/type/optional/ToString/Equals; the broader [[nodiscard]] cleanup belongs in a separate PR.
The unknown/variant/geometry/geography branch reported "{type} fields must
default to null", which is ambiguous for the unknown type (reads as
"unrecognized fields") and awkwardly interpolates a parametrized type
(e.g. geometry(srid:...)) into a categorical sentence. Reword to
"type {} cannot have a default value" and align with the sibling
"Invalid {} value for {}: ..." messages.
SchemaField::Make and the file-local NormalizeDefault cast a cross-type default (e.g. an int literal on a long field) to the field type, but they have no production callers: defaults arrive already at the field type (FieldFromJson parses with the field type), and ValidateDefault's exact type match already rejects any cross-type default. The cast was therefore dead in production and, as wgtmac noted in review, allocated a new Literal even in the same-type case. Remove Make and NormalizeDefault (and their tests). The constructor stores defaults verbatim and Validate rejects type mismatches. The cast belongs in the follow-up that adds a real caller (e.g. schema evolution), where Java performs it in the NestedField constructor (castDefault).
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for addressing all the feedback! Now this looks good to me.
One minor issue is that ValidateDefault now rejects any default whose literal type is not exactly the field type. That still diverges from Java: Types.NestedField.castDefault casts the provided literal to the field type and stores the converted literal. I don't think this is a blocker for now.
…s (2/4) When a column is present in the read schema but missing from a Parquet data file (written before the column existed), fill it with the column's v3 initial-default instead of null. Adds a shared Arrow materializer (arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a kDefault projection branch in the Parquet schema/data projection paths. Part 2 of the v3 column-default-values work (POC apache#731), built on the schema support merged in apache#746.
…s (2/4) When a column is present in the read schema but missing from a Parquet data file (written before the column existed), fill it with the column's v3 initial-default instead of null. Adds a shared Arrow materializer (arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a kDefault projection branch in the Parquet schema/data projection paths. The materializer wraps the storage array in the extension type for extension types such as `arrow.uuid` (compute::Cast has no storage->extension kernel). Part 2 of the v3 column-default-values work (POC apache#731), built on the schema support merged in apache#746.
…s (2/4) When a column is present in the read schema but missing from a Parquet data file (written before the column existed), fill it with the column's v3 initial-default instead of null. Adds a shared Arrow materializer (arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a kDefault projection branch in the Parquet schema/data projection paths. The materializer wraps the storage array in the extension type for extension types such as `arrow.uuid` (compute::Cast has no storage->extension kernel). Part 2 of the v3 column-default-values work (POC apache#731), built on the schema support merged in apache#746.
…s (2/4) When a column is present in the read schema but missing from a Parquet data file (written before the column existed), fill it with the column's v3 initial-default instead of null. Adds a shared Arrow materializer (arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a kDefault projection branch in the Parquet schema/data projection paths. The materializer wraps the storage array in the extension type for extension types such as `arrow.uuid` (compute::Cast has no storage->extension kernel). Part 2 of the v3 column-default-values work (POC apache#731), built on the schema support merged in apache#746.
…s (2/4) When a column is present in the read schema but missing from a Parquet data file (written before the column existed), fill it with the column's v3 initial-default instead of null. Adds a shared Arrow materializer (arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a kDefault projection branch in the Parquet schema/data projection paths. The materializer wraps the storage array in the extension type for extension types such as `arrow.uuid` (compute::Cast has no storage->extension kernel). Part 2 of the v3 column-default-values work (POC apache#731), built on the schema support merged in apache#746.
…s (2/4) When a column is present in the read schema but missing from a Parquet data file (written before the column existed), fill it with the column's v3 initial-default instead of null. Adds a shared Arrow materializer (arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a kDefault projection branch in the Parquet schema/data projection paths. The materializer wraps the storage array in the extension type for extension types such as `arrow.uuid` (compute::Cast has no storage->extension kernel). Part 2 of the v3 column-default-values work (POC apache#731), built on the schema support merged in apache#746.
AddColumn / AddRequiredColumn now accept an optional default value, used as both the initial-default and write-default of the new column; a non-null default also lets a required column be added (or an added column be made required) without AllowIncompatibleChanges(). UpdateColumnDefault sets or clears the write-default. Defaults are cast to the column type (rejecting uncastable or out-of-range values) and preserved across rename / doc / type updates and nested field-id reassignment. Part 4 of the v3 column-default-values work (POC apache#731), built on apache#746.
AddColumn / AddRequiredColumn now accept an optional default value, used as both the initial-default and write-default of the new column; a non-null default also lets a required column be added (or an added column be made required) without AllowIncompatibleChanges(). UpdateColumnDefault sets or clears the write-default. Defaults are cast to the column type (rejecting uncastable or out-of-range values) and preserved across rename / doc / type updates and nested field-id reassignment. Part 4 of the v3 column-default-values work (POC apache#731), built on apache#746.
AddColumn / AddRequiredColumn now accept an optional default value, used as both the initial-default and write-default of the new column; a non-null default also lets a required column be added (or an added column be made required) without AllowIncompatibleChanges(). UpdateColumnDefault sets or clears the write-default. Defaults are cast to the column type (rejecting uncastable or out-of-range values) and preserved across rename / doc / type updates and nested field-id reassignment. Part 4 of the v3 column-default-values work (POC apache#731), built on apache#746.
Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.
This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.
What's in this PR
SchemaFieldcarriesinitial-default/write-default, stored asstd::shared_ptr<const Literal>(immutable payload shared across copies, like theadjacent
type_; the C++ analog of Java'sfinal Literal<?>). They are set via theconstructor. Getters return
std::optional<std::reference_wrapper<const Literal>>forreading (the
Schema::FindFieldByNameidiom);initial_default_ptr()/write_default_ptr()expose the shared pointer so a rebuilt field (e.g. IDreassignment) shares the value instead of copying it.
initial-default/write-defaultusing the existingsingle-value serialization (all primitive types).
Schema::Validate: version-gates theinitial-defaultto format v3(
kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,so it requires the v3 reader contract. The
write-defaultonly affects values writtengoing forward and is not version-gated (matching Java's
Schema.checkCompatibility,which gates only the initial default). Both defaults are otherwise validated to be
non-null primitive literals matching the field type.
initial-defaultmaps to
FieldProjection::Kind::kDefaultcarrying the literal (the per-format readersconsume this in the follow-up PRs).
Follow-ups (stacked on this PR)
literal_util+ parquet projection/materialization)UpdateSchemaadd/update column defaults)Testing
Added tests